{ "cells": [ { "cell_type": "markdown", "id": "1e468c88-42c6-4ef9-9189-3d04cdef8495", "metadata": {}, "source": [ "# Dirichlet Distribution\n", "\n", "In this section, we will be showcasing the Dirichlet Distribution and using D3.js {cite:p}`bostock2011d3` to provide illustrations for concepts. " ] }, { "cell_type": "code", "execution_count": 9, "id": "ad450696-5f96-4bd9-a1f7-c1a44b5bfe20", "metadata": { "tags": [ "hide-cell" ] }, "outputs": [], "source": [ "from IPython.display import HTML\n", "\n", "def load_d3_in_cell_output():\n", " display(HTML(\"\"))\n", "get_ipython().events.register('pre_run_cell', load_d3_in_cell_output)" ] }, { "cell_type": "markdown", "id": "129bb8b0-4830-4dbc-9873-e2509fb66ae7", "metadata": {}, "source": [ "## The Chinese Restaurant Process" ] }, { "cell_type": "markdown", "id": "c39b1569-bfc7-41ec-89f7-3c87a4103365", "metadata": {}, "source": [ "In the thought problem, we will be examing a situation where a hungry person (πŸ€”) enters a restaurant and needs to choose a table (βšͺ).\n", "\n", "This original was developed by {cite:p}`aldous1985exchangeability` and a great resource to consider is Pasupat's ({cite:p}`Pasupat_2021`).\n", "\n", "Here are the ground rules for this thought problem.\n", " " ] }, { "cell_type": "markdown", "id": "af48f15f-7d3b-444a-b745-8e14c895dff7", "metadata": {}, "source": [ "## Rules for Our Thought Problem" ] }, { "cell_type": "markdown", "id": "802c2fbf-1898-4c04-90a0-7ace8baedd52", "metadata": {}, "source": [ "### 1. An Infinite Amount of Tables (βšͺ)\n", "\n", "We are depicting five tables (βšͺβšͺβšͺβšͺβšͺ), but we need to consider a situation where the number of tables is infinite. \n", "\n", "* βšͺ = ∞" ] }, { "cell_type": "markdown", "id": "7342c7ba-74d9-4ece-ba1b-4dc6576b827f", "metadata": {}, "source": [ "### 2. A Hungry Person (πŸ€”) Only Two Options\n", "\n", "When a hungry person (πŸ€”) walks into the restaurant they have two options: \n", " \n", "* Either they sit a table (βšͺ) with someone else (πŸ˜ƒ) \n", "* or pick a new table (βšͺ) \n", "\n", "To simplify this, here a decision chart. " ] }, { "cell_type": "code", "execution_count": 10, "id": "05be4377-481c-4f6a-8885-54e82f716645", "metadata": { "tags": [ "remove-input" ] }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/svg+xml": [ "A hungry πŸ€” walks into the restaurantA hungry πŸ€” walks into the restaurantSit at a βšͺ with other πŸ˜ƒSit at a βšͺ with other πŸ˜ƒSit at a new βšͺSit at a new βšͺ" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from IPython.display import SVG, display\n", "display(SVG(url='https://raw.githubusercontent.com/dudaspm/LDA_Bias_Data/main/images/startCondition.svg'))" ] }, { "cell_type": "markdown", "id": "d2b89149-0e89-4dfb-9c67-513764ed4526", "metadata": {}, "source": [ "And to further reduce this down, we will be using this:" ] }, { "cell_type": "markdown", "id": "e160df99-5cce-4a39-af1d-ceeb947ea3a3", "metadata": {}, "source": [ "### 3. Many βšͺ & πŸ˜ƒ, Only One Empty βšͺ\n", "\n", "This goes with #2, but in our scenario, there is the number of tables (βšͺ) with people (πŸ˜ƒ), but when considering an empty table (βšͺ). We will only consider *one* of the infinite number of tables (βšͺ) open. Another way to consider this is either a hungry person (πŸ€”):\n", "* sits at the *one of possible many* tables (βšͺ) with someone else (πŸ˜ƒ) \n", "* *OR* they sit at the *one* new table (βšͺ)" ] }, { "cell_type": "markdown", "id": "33113b93-32e2-450f-a5fb-13cf335ab3e8", "metadata": {}, "source": [ "### All Tables (βšͺ) are Equal\n", "Notice that all the tables are equal distance away. So, there is no weighting based on the distance, and each table is equally likely to be picked. " ] }, { "cell_type": "code", "execution_count": 11, "id": "ac1b59d4-c268-464a-b087-8964b9c2547a", "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "\n", "
\n", "\n", "" ] }, { "cell_type": "markdown", "id": "782ab924-62be-4ad3-a67d-0df68a631478", "metadata": {}, "source": [ "### Key for Thought Problem\n", "\n", "> πŸ€” - hungry person\n", "* The person who needs to find a seat at a table\n", "\n", "> πŸ˜ƒ - person eating\n", "* A person already at a table\n", "\n", "> βšͺ - a possible table\n", "* A potential seat for the hungry person to sit at\n", "\n", "> ⚫ - a not possible table \n", "* Not a potential seat for the hungry person to sit at (see Rule #3)." ] }, { "cell_type": "markdown", "id": "341b4578-6590-423e-862d-426046664b41", "metadata": {}, "source": [ "## All Solutions πŸ’₯TO THE EXTREMEπŸ’₯" ] }, { "cell_type": "markdown", "id": "387dc94a-1424-46c5-9db5-16dbaccf0e46", "metadata": {}, "source": [ "Now that we have our ground rules let's approach this problem from, what I am calling, the extreme positions. We have not mentioned a single bit of math up to this point, but this section will contain conversations around probabilities. Here are three scenarios for our extreme positions. \n", "\n", "1. The Social Butterfly\n", "2. The Gambler\n", "3. The Long Day" ] }, { "cell_type": "markdown", "id": "bec17d1f-d5ba-4a01-a740-b414c13e2f47", "metadata": {}, "source": [ "### 1. The Social Butterfly\n", "\n", "The social butterfly assumes every person that enters the restaurants wants to sit at the table with the most people. " ] }, { "cell_type": "code", "execution_count": 12, "id": "8b03680d-d20e-42c6-bea6-1fbf5e658b60", "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "\n", "
\n", "\n", "" ] }, { "cell_type": "markdown", "id": "433e57bf-1419-48fd-b152-3926a9e0f635", "metadata": {}, "source": [ "The following person (πŸ€”) walks in and sits at the most popular table. " ] }, { "cell_type": "code", "execution_count": 13, "id": "51934260-b24a-4212-b3f7-83dbe283931b", "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "\n", "
\n", "\n", "" ] }, { "cell_type": "markdown", "id": "9d8ff65c-4e87-4671-8b5a-cb1e0fbf7757", "metadata": {}, "source": [ "and repeat this process for the next three customers (πŸ€”)." ] }, { "cell_type": "code", "execution_count": 14, "id": "2175f40b-0e1b-454e-93da-c75686e3a859", "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "\n", "
\n", "\n", "" ] }, { "cell_type": "markdown", "id": "ef83f290-8744-4a3c-bd23-9478abb82a8e", "metadata": {}, "source": [ "### 2. The Gambler\n", "\n", "The Gambler is the person who only cares about the probabilities. Meaning, if there are two tables (βšͺ), then they have a 50/50 choice, and they do not care at all about the people sitting there or not. " ] }, { "cell_type": "code", "execution_count": 15, "id": "2578cef5-bfb5-4213-afb4-f4d08524bd35", "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "\n", "
\n", "\n", "" ] }, { "cell_type": "markdown", "id": "4f164bf4-021a-4025-898f-71795296c65b", "metadata": {}, "source": [ "Where the probability is now $p = \\frac{1}{2}$" ] }, { "cell_type": "code", "execution_count": 16, "id": "d0445c09-ef73-4f4b-9946-cb23e7b97b96", "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "\n", "
\n", "\n", "" ] }, { "cell_type": "markdown", "id": "077009c8-bd5a-451c-9323-7c7a8d564d8a", "metadata": {}, "source": [ "Now $p = \\frac{1}{3}$" ] }, { "cell_type": "code", "execution_count": 17, "id": "2288ced5-f72d-4688-8721-da87ee0cafa5", "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "\n", "
\n", "\n", "" ] }, { "cell_type": "markdown", "id": "ae348f82-fa58-4b29-bbd1-7b6ef9280ce4", "metadata": {}, "source": [ "Then probability is now $p = \\frac{1}{4}$" ] }, { "cell_type": "code", "execution_count": 18, "id": "46a6e553-ea93-46ee-a509-04bb92c8b79c", "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "\n", "
\n", "\n", "" ] }, { "cell_type": "markdown", "id": "00943292-f3c7-4e93-aa62-8835c0b112a3", "metadata": {}, "source": [ "Finally, all tables (βšͺ) are probability $p = \\frac{1}{5}$" ] }, { "cell_type": "code", "execution_count": 19, "id": "7dfc6ce6-f9a6-41b5-8281-876d6a04d196", "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "\n", "
\n", "\n", "" ] }, { "cell_type": "markdown", "id": "ca29cf87-050e-444a-b988-dd377d5c202a", "metadata": {}, "source": [ "### 3. The Long Day\n", "\n", "The Long Day scenario describes a situation where customers (πŸ€”) coming into the restaurant had a reeeeeeeeeeeeeeeally long day. All they want is a table (βšͺ) to themselves to eat their food, pay, and go home. This scenario is the opposite of the Social Butterfly, where people are at a table (πŸ˜ƒ & βšͺ). They will find an empty table (βšͺ).\n" ] }, { "cell_type": "code", "execution_count": 20, "id": "ddf03ba3-0919-4019-b48a-3a2288c17664", "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "\n", "
\n", "\n", "" ] }, { "cell_type": "markdown", "id": "5bf1052e-6817-48f1-8a6b-8b5588ab09d9", "metadata": {}, "source": [ "With this selection, the customer (πŸ€”) will always pick the new table. " ] }, { "cell_type": "code", "execution_count": 21, "id": "e5e51927-29a5-4d52-b15c-ed7d34eab11f", "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "\n", "
\n", "\n", "" ] }, { "cell_type": "markdown", "id": "ab9ad2e4-8c75-4fff-9ed0-cbb735bb19e9", "metadata": {}, "source": [ "Repeat for all customers (πŸ€”)." ] }, { "cell_type": "code", "execution_count": 22, "id": "8f51be78-d665-4494-990f-3724d3f18362", "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "\n", "
\n", "\n", "" ] }, { "cell_type": "markdown", "id": "230ae6ae-3ab8-4633-accd-e0dd4540c437", "metadata": {}, "source": [ "## The Conclusions\n", "\n", "### ✨1st Conclusion✨\n", "\n", "So, let's take a look at all three of these scenario results." ] }, { "cell_type": "code", "execution_count": 38, "id": "d421d13c-b5e2-4751-b13d-7bd157053218", "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "\n", "
\n", "\n", "" ] }, { "cell_type": "markdown", "id": "9a29dac2-de22-481a-944e-882d1d14ea70", "metadata": {}, "source": [ "Our ✨1st Conclusion✨ is that for each scenario, the total probabilities (when added together) equal 1. This conclusion is our first connection to the *Dirichlet Distribution*. " ] }, { "cell_type": "markdown", "id": "15369812-5388-47d2-bd4a-fe5de80e8c81", "metadata": {}, "source": [ "```{admonition} Dirichlet Distribution Always Sum to 1\n", ":class: tip\n", "Regardless of the number of tables (βšͺ), the number of people at the tables (πŸ˜ƒ), or a hungry persons' (πŸ€”) strategy. The total probability will be 1. This concept is also considered to be a *probability mass function* or PMF property. \n", "```" ] }, { "cell_type": "markdown", "id": "4e850ab8-697d-4ffd-9c02-0be4485080aa", "metadata": { "tags": [] }, "source": [ "### ✨2nd Conclusion✨\n", "\n", "This easiest to see with our \"The Gambler\" scenerio. " ] }, { "cell_type": "code", "execution_count": 24, "id": "7f436473-c6e6-4ede-83b7-846979cff2cf", "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "\n", "
\n", "\n", "" ] }, { "cell_type": "markdown", "id": "f7e61d25-5eff-4691-a9cd-ab374f226abf", "metadata": { "tags": [] }, "source": [ "```{admonition} When All Possibility Are Equally Likely\n", ":class: tip\n", "In situations where are all possibilities are equally likely (equally likely to sit at a table with someone else (βšͺ&πŸ˜ƒ) or sit at a new table (βšͺ)), we can abbreviate this to a simple probablity:\n", "\n", "$\\frac{πŸ˜ƒ}{πŸ˜ƒπŸ˜ƒπŸ˜ƒπŸ˜ƒπŸ€”}$ \n", "$ = $\n", "$\\frac{\\text{Number of people sitting at table (βšͺ&πŸ˜ƒ)}}{\\text{All people (πŸ˜ƒπŸ˜ƒπŸ˜ƒπŸ˜ƒπŸ€”)}}$ \n", "$ = $\n", "$\\frac{N_j}{N}$\n", "\n", "AND \n", "\n", "$\\frac{πŸ€”}{πŸ˜ƒπŸ˜ƒπŸ˜ƒπŸ˜ƒπŸ€”}$ \n", "$ = $\n", "$\\frac{\\text{Number of people who can sit at a new table (βšͺ)}}{\\text{All people (πŸ˜ƒπŸ˜ƒπŸ˜ƒπŸ˜ƒπŸ€”)}}$ \n", "$ = $\n", "$\\frac{1}{N}$\n", "```" ] }, { "cell_type": "markdown", "id": "2be6b1e5-8c63-4de7-96cf-1eaec00fc394", "metadata": {}, "source": [ "```{admonition} To Join Or Not To Join\n", ":class: tip\n", "As shown in the animation, there are two conditions: to join *or* not to join. Both of these take advantage of the $\\frac{N_j}{N}$ relationship, but it can be seen that as tables are filled, the more likely this could occur. Meaning... \n", "\n", "*To join*\n", "\n", "$\\frac{πŸ˜ƒ}{πŸ˜ƒπŸ˜ƒπŸ˜ƒπŸ˜ƒπŸ€”}$ \n", "$ + $\n", "$\\frac{πŸ˜ƒ}{πŸ˜ƒπŸ˜ƒπŸ˜ƒπŸ˜ƒπŸ€”}$ \n", "$ + $\n", "$\\frac{πŸ˜ƒ}{πŸ˜ƒπŸ˜ƒπŸ˜ƒπŸ˜ƒπŸ€”}$ \n", "$ + $\n", "$\\frac{πŸ˜ƒ}{πŸ˜ƒπŸ˜ƒπŸ˜ƒπŸ˜ƒπŸ€”}$ \n", "$ + $\n", "$ = $\n", "$\\frac{N-1}{N}$ \n", "\n", "*Or Not To Join*\n", "\n", "$\\frac{πŸ€”}{πŸ˜ƒπŸ˜ƒπŸ˜ƒπŸ˜ƒπŸ€”}$ \n", "$ = $\n", "$\\frac{1}{N}$ \n", "```" ] }, { "cell_type": "markdown", "id": "f51f26f7-cf6c-42d1-bdb1-762a4f49fc52", "metadata": {}, "source": [ "To expand on this idea, we need to look at more of a real situation. Where someone entering the restaurant is making a decision and not the extremes. In the visualization below, click on the tables to add people to that table. We are introducing a new concept as well. That being a probability (p). The first table, before anyone is seated, will be $\\frac{p}{p} = 1 $. Meaning, if we change this probability to any number, the math will always work out. Then as we add people to tables, we will keep expanding on this introduction of the probability. \n", "\n", "Remember, for each table, we are saying $\\frac{N_j}{N}$, where $N_j = \\text{Number of people sitting at table j (}βšͺ_j\\text{&πŸ˜ƒ)}$ and $N = \\text{All people (πŸ˜ƒπŸ˜ƒπŸ˜ƒπŸ˜ƒπŸ€”)}$ = $\\frac{\\text{Number of people sitting at table (βšͺ&πŸ˜ƒ)}}{\\text{All people (πŸ˜ƒπŸ˜ƒπŸ˜ƒπŸ˜ƒπŸ€”)}}$ \n", "\n", "Also, $\\frac{\\text{Number of people who can sit at a new table (βšͺ)}}{\\text{All people (πŸ˜ƒπŸ˜ƒπŸ˜ƒπŸ˜ƒπŸ€”)}}$ $ = \\frac{1}{N}$\n", "\n", "or simply\n", "$\\frac{N_j}{N} + \\frac{1}{N} = 1$\n" ] }, { "cell_type": "markdown", "id": "7130d92d-9b21-40b2-9cac-abe20e4245c3", "metadata": {}, "source": [ "```{note} \n", "In the next animation, click on the circles to add new people. \n", "```" ] }, { "cell_type": "code", "execution_count": 25, "id": "2f99db35-11fb-498c-bdba-9cc12a23a1e7", "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "
\n", "\n", "" ] }, { "cell_type": "markdown", "id": "c1ae54d2-0548-4459-aa42-fb458f8d1d55", "metadata": {}, "source": [ "This lines up perfectly with what we specified beforehand. \n", "\n", "$\\frac{N_0}{N+p}+\\frac{N_1}{N+p}+\\frac{N_2}{N+p}+\\frac{N_3}{N+p}+\\frac{N_4}{N+p}+\\frac{N_5}{N+p}+\\frac{N_6}{N+p}+\\frac{N_7}{N+p}+\\frac{N_8}{N+p}+\\frac{p}{N+p} = $\n", "\n", "$\\frac{N_0+N_1+N_2+N_3+N_4+N_5+N_6+N_7+N_8+p}{N+p} = $ where $N_0+N_1+N_2+N_3+N_4+N_5+N_6+N_7+N_8 = N$\n", "\n", "$\\frac{N+p}{N+p} = 1$\n", "\n", "or, even better\n", "\n" ] }, { "cell_type": "markdown", "id": "d68621cb-85c6-430c-8c16-fd0d5341d7a8", "metadata": {}, "source": [ "```{admonition} The Predictive Probability\n", ":class: tip\n", "The Dirichlet has the predicitve probability of \n", "\n", "$\\frac{N_j}{N+p} + \\frac{p}{N+p} = 1$\n", "\n", "Where, traditionally, you will see p written as alpha ($\\alpha$).\n", "\n", "So...\n", "$\\frac{N_j}{N+\\alpha} + \\frac{\\alpha}{N+\\alpha} = 1$\n", "\n", "```" ] }, { "cell_type": "markdown", "id": "94fff499-79c1-4ee4-a47e-1d1edf54a77d", "metadata": {}, "source": [ "### πŸ’° The Rich Get Richer πŸ’°\n", "\n", "When dealing with the Dirichlet process, we are dealing with probabilities. As before, p or $\\alpha$, represents a probability. In this case, the person entering will more than likely want to sit at a table and, more specifically, the table with the most people. This will rarely be as extreme as \"The Social Butterfly,\" but instead be represented by the $\\frac{N_j}{N+\\alpha} + \\frac{\\alpha}{N+\\alpha} = 1$ for each table. \n", "\n", "In this next animation, we will simulate the same experience as above but more realistic to the Dirichlet process. " ] }, { "cell_type": "code", "execution_count": 35, "id": "4284fa30-ec71-4553-8dd4-a298e3df3c88", "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "

Changing Alpha Values

\n", "\n", "
\n", "\n", "
\n", "\n", " .2\n", " 1\n", " 5\n", "
\n", "
\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "

Changing Alpha Values

\n", "\n", "
\n", "\n", "
\n", "\n", " .2\n", " 1\n", " 5\n", "
\n", "
\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "71ce6c84-ffa5-4c75-aa74-7604408d39f6", "metadata": {}, "source": [ "## Actual Data\n", "\n", "To see some actual data, we will be using scipy for five topics and five documents.\n", "\n", "Note to print these 5 x 5 tables, we used code from SlackOverflow {cite:p}`table_jupyter_stackoverflow`.\n" ] }, { "cell_type": "code", "execution_count": 27, "id": "5285ec9d-a50e-428a-8a79-a287d0ec84b9", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from scipy.stats import dirichlet\n", "import numpy as np" ] }, { "cell_type": "markdown", "id": "7e3ecbad-506f-4a3e-95cf-76b8cf23b256", "metadata": {}, "source": [ "### alpha = .01" ] }, { "cell_type": "code", "execution_count": 28, "id": "e3f81948-bc89-4701-91cf-487011a06668", "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
topic 0topic 1topic 2topic 3topic 4
document 00.01.00.00.00.0
document 10.20.80.00.00.0
document 21.00.00.00.00.0
document 30.00.00.00.01.0
document 40.00.00.00.01.0
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from IPython.display import HTML, display\n", "\n", "alpha = np.array([0.01, 0.01, 0.01, 0.01, 0.01])\n", "data_from_dirichlet = np.around(dirichlet.rvs(alpha, size=5), decimals=1).tolist()\n", "data_for_output = []\n", "\n", "temp = []\n", "temp.append(\"\")\n", "\n", "for i in range(5):\n", " temp.append(\"topic %s\" % (i))\n", "\n", "data_for_output.append(temp) \n", "for i in range(len(data_from_dirichlet)):\n", " temp = []\n", " temp.append(\"document %s\" % (i))\n", " for j in data_from_dirichlet[i]:\n", " temp.append(j)\n", " data_for_output.append(temp)\n", "\n", "display(HTML(\n", " '{}
'.format(\n", " ''.join(\n", " '{}'.format(''.join(str(_) for _ in row)) for row in data_for_output)\n", " )\n", "))" ] }, { "cell_type": "markdown", "id": "c53afdb1-6dc7-42a4-8264-664a9edc8068", "metadata": {}, "source": [ "### alpha = .1" ] }, { "cell_type": "code", "execution_count": 29, "id": "ed00b52a-908a-4657-a2bf-eb413224f59b", "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
topic 0topic 1topic 2topic 3topic 4
document 00.10.00.90.00.0
document 10.00.01.00.00.0
document 20.00.10.90.00.0
document 30.80.10.10.00.0
document 40.00.00.00.80.2
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "alpha = np.array([0.1, 0.1, 0.1, 0.1, 0.1])\n", "\n", "data_from_dirichlet = np.around(dirichlet.rvs(alpha, size=5), decimals=1).tolist()\n", "data_for_output = []\n", "\n", "temp = []\n", "temp.append(\"\")\n", "\n", "for i in range(5):\n", " temp.append(\"topic %s\" % (i))\n", "\n", "data_for_output.append(temp) \n", "for i in range(len(data_from_dirichlet)):\n", " temp = []\n", " temp.append(\"document %s\" % (i))\n", " for j in data_from_dirichlet[i]:\n", " temp.append(j)\n", " data_for_output.append(temp)\n", "\n", "display(HTML(\n", " '{}
'.format(\n", " ''.join(\n", " '{}'.format(''.join(str(_) for _ in row)) for row in data_for_output)\n", " )\n", "))" ] }, { "cell_type": "markdown", "id": "2fbe7fbd-7d41-4444-807d-620dae4c60f2", "metadata": {}, "source": [ "### alpha = 1" ] }, { "cell_type": "code", "execution_count": 30, "id": "70c95cc5-8748-4b54-8d39-821e662249cf", "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
topic 0topic 1topic 2topic 3topic 4
document 00.00.30.20.30.3
document 10.60.10.10.10.1
document 20.50.10.20.10.2
document 30.00.30.30.30.1
document 40.20.10.20.30.1
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "alpha = np.array([1, 1, 1, 1, 1])\n", "\n", "data_from_dirichlet = np.around(dirichlet.rvs(alpha, size=5), decimals=1).tolist()\n", "data_for_output = []\n", "\n", "temp = []\n", "temp.append(\"\")\n", "\n", "for i in range(5):\n", " temp.append(\"topic %s\" % (i))\n", "\n", "data_for_output.append(temp) \n", "for i in range(len(data_from_dirichlet)):\n", " temp = []\n", " temp.append(\"document %s\" % (i))\n", " for j in data_from_dirichlet[i]:\n", " temp.append(j)\n", " data_for_output.append(temp)\n", "\n", "display(HTML(\n", " '{}
'.format(\n", " ''.join(\n", " '{}'.format(''.join(str(_) for _ in row)) for row in data_for_output)\n", " )\n", "))\n" ] }, { "cell_type": "markdown", "id": "5fc6c589-007c-4dc8-9d92-273481115601", "metadata": {}, "source": [ "### alpha = 10" ] }, { "cell_type": "code", "execution_count": 31, "id": "a13d4256-d8fe-41ff-9c07-ef7b7c6ed07a", "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
topic 0topic 1topic 2topic 3topic 4
document 00.30.10.20.20.2
document 10.20.30.20.20.1
document 20.20.20.20.20.2
document 30.30.10.20.20.2
document 40.20.20.30.20.1
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "alpha = np.array([10, 10, 10, 10, 10])\n", "\n", "data_from_dirichlet = np.around(dirichlet.rvs(alpha, size=5), decimals=1).tolist()\n", "data_for_output = []\n", "\n", "temp = []\n", "temp.append(\"\")\n", "\n", "for i in range(5):\n", " temp.append(\"topic %s\" % (i))\n", "\n", "data_for_output.append(temp) \n", "for i in range(len(data_from_dirichlet)):\n", " temp = []\n", " temp.append(\"document %s\" % (i))\n", " for j in data_from_dirichlet[i]:\n", " temp.append(j)\n", " data_for_output.append(temp)\n", "\n", "display(HTML(\n", " '{}
'.format(\n", " ''.join(\n", " '{}'.format(''.join(str(_) for _ in row)) for row in data_for_output)\n", " )\n", "))\n" ] }, { "cell_type": "markdown", "id": "a925ea02-8f5a-4121-b2fe-3512535f0674", "metadata": {}, "source": [ "### alpha = 100" ] }, { "cell_type": "code", "execution_count": 32, "id": "69a06ed2-d7e1-4e2b-a7e2-d40d502cf516", "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
topic 0topic 1topic 2topic 3topic 4
document 00.20.20.20.20.2
document 10.20.20.20.20.2
document 20.20.20.20.20.2
document 30.20.20.20.20.2
document 40.20.20.20.20.2
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "alpha = np.array([100, 100, 100, 100, 100])\n", "\n", "data_from_dirichlet = np.around(dirichlet.rvs(alpha, size=5), decimals=1).tolist()\n", "data_for_output = []\n", "\n", "temp = []\n", "temp.append(\"\")\n", "\n", "for i in range(5):\n", " temp.append(\"topic %s\" % (i))\n", "\n", "data_for_output.append(temp) \n", "for i in range(len(data_from_dirichlet)):\n", " temp = []\n", " temp.append(\"document %s\" % (i))\n", " for j in data_from_dirichlet[i]:\n", " temp.append(j)\n", " data_for_output.append(temp)\n", "\n", "display(HTML(\n", " '{}
'.format(\n", " ''.join(\n", " '{}'.format(''.join(str(_) for _ in row)) for row in data_for_output)\n", " )\n", "))\n" ] }, { "cell_type": "markdown", "id": "48e5f5c1-d232-492e-b488-8914740adf5c", "metadata": {}, "source": [ "## Force-Directed Graph + Dirichlet Example\n", "\n", "To tie this conversation together, we coupled the force-directed graph and Dirichlet. The first set of code (hidden) created the original dataset. The force-directed graph highlights, as you change the alpha, how the documents shift in space. " ] }, { "cell_type": "code", "execution_count": 33, "id": "e4b382cf-6a53-4d8d-acdf-0830cc2879c2", "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import json\n", "\n", "alphas = [.001,.005,.01,.05,.1,.5,1,1.5,5,10,100]\n", "\n", "output = {}\n", "\n", "for a in alphas:\n", " b = [a] * 20\n", " alpha = np.array(b)\n", " data_from_dirichlet = dirichlet.rvs(alpha, size=10, random_state=53155618)\n", " data_from_dirichlet = np.around(data_from_dirichlet, decimals=3).tolist()\n", " output[a] = data_from_dirichlet\n", " \n", " \n", "#with open('dirichlet.json', 'w') as outfile:\n", "# json.dump(output, outfile)\n", " " ] }, { "cell_type": "code", "execution_count": 36, "id": "7d771505-9ded-4951-b77c-40b6abb5d6b5", "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

Changing Alpha Values

\n", " .001\n", " .005\n", " .01\n", " .05\n", " .1\n", " .5\n", " 1\n", " 1.5\n", " 5\n", " 10\n", " 100\n", "\n", "

\n", "
\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "

Changing Alpha Values

\n", " .001\n", " .005\n", " .01\n", " .05\n", " .1\n", " .5\n", " 1\n", " 1.5\n", " 5\n", " 10\n", " 100\n", "\n", "

\n", "
\n", "" ] }, { "cell_type": "markdown", "id": "85eca051-28b4-4345-adcd-ab83c53de0b4", "metadata": {}, "source": [ "## Next we provide a use-case for LDA. " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 5 }